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We describe some recent approaches to likelihood based inference in the presence of nuisance parameters. 
Our approach is based on plotting the likelihood function and the p-value function, using recently developed 
third order approximations. Orthogonal parameters and adjustments to profile likelihood are also discussed. 
Connections to classical approaches of conditional and marginal inference are outlined. 



1. INTRODUCTION 

We take the view that the most effective form of 
inference is provided by the observed likelihood func- 
tion along with the associated p-value function. In 
the case of a scalar parameter the likelihood func- 
tion is simply proportional to the density function. 
The p-value function can be obtained exactly if there 
is a one-dimensional statistic that measures the pa- 
rameter. If not, the p-value can be obtained to a 
high order of approximation using recently developed 
methods of likelihood asymptotics. In the presence 
of nuisance parameters, the likelihood function for a 
(one-dimensional) parameter of interest is obtained 
via an adjustment to the profile likelihood function. 
The p-value function is obtained from quantities com- 
puted from the likelihood function using a canonical 
parametrization tp — ip(0), which is computed locally 
at the data point. This generalizes the method of 
eliminating nuisance parameters by conditioning or 
marginalizing to more general contexts. In Section 
2 we give some background notation and introduce 
the notion of orthogonal parameters. In Section 3 we 
illustrate the p-value function approach in a simple 
model with no nuisance parameters. Profile likelihood 
and adjustments to profile likelihood are described in 
Section 4. Third order p- values for problems with nui- 
sance parameters are described in Section 5. Section 
6 describes the classical conditional and marginal like- 
lihood approach. 



2. NOTATION AND ORTHOGONAL 
PARAMETERS 

We assume our measurement (s) y can be modelled 
as coming from a probability distribution with density 
or mass function f(y; 9), where 9 = (tp, A) takes values 
in R d . We assume tp is a one-dimensional parameter 
of interest, and A is a vector of nuisance parameters. 
If there is interest in more than one component of 
0, the methods described here can be applied to each 
component of interest in turn. The likelihood function 
is 



it is defined only up to arbitrary multiples which may 
depend on y but not on 9. This ensures in particu- 
lar that the likelihood function is invariant to one-to- 
one transformations of the measurement (s) y. In the 
context of independent, identically distributed sam- 
pling, where y = (yi, . . . , y n ) and each g/j follows the 
model f(y; 9) the likelihood function is proportional 
to n/(j/i; 9) and the log- likelihood function becomes a 
sum of independent and identically distributed com- 
ponents: 



£(9) = £(9;y) = mogf( yi ;9) + a(y). 



(2) 



The maximum likelihood estimate 9 is the value of 
9 at which the likelihood takes its maximum, and in 
regular models is defined by the score equation 



£'(9;y) = 0. 



(3) 



The observed Fisher information function j(9) is the 
curvature of the log-likelihood: 



(4) 



and the expected Fisher information is the model 
quantity 



i{9)=E{-£"{9)} 



-£"(6;y)f(y;d)dy. (5) 



If y is a sample of size n then i(ff) = 0(n). 

In accord with the partitioning of 9 we partition the 
observed and expected information matrices and use 
the notation 



and 



-\9) 



(G) 



(7) 



L(9) = L(9;y)=c(y)f(y;9); 



(1) 



We say ip is orthogonal to A (with respect to expected 
Fisher information) if i^\(9) — 0. When ip is scalar 
a transformation from [tp, A) to (ip, 7]{ip, A)) such that 
tp is orthogonal to rj can always be found (Cox and 
Reid, [1]). The most directly interpreted consequence 



THAT001 



2 



PHYSTAT2003, SLAG, September 8-11, 2003 



of parameter orthogonality is that the maximum like- 
lihood estimates of orthogonal components are asymp- 
totically independent. 

Example 1: ratio of Poisson means Suppose 
yi and 2/2 are independent counts modelled as Poisson 
with mean A and tpX, respectively. Then the likelihood 
function is 

L(ip, A; 2/1,2/2) = e - x{1+ ^i,y-\y^ 

and ip is orthogonal to ^(-i/;, A) = X(ip + 1). In 
fact in this example the likelihood function factors as 
Li(tp)L 2 (ri), which is a stronger property than param- 
eter orthogonality. The first factor is the likelihood for 
a binomial distribution with index j/i 4- 2/2 and proba- 
bility of success "0/(1 + VOi and the second is that for 
a Poisson distribution with mean 77. 

Example 2: exponential regression Sup- 
pose yi,i = 1, ...,n are independent observations, 
each from an exponential distribution with mean 
Aexp(— ipXi), where Xi is known. The log-likelihood 
function is 

A; y) = -n log A + tpT,Xi - X^ 1 ^ exp(tpXi) (8) 

and i^\{9) = if and only if Sx^ = 0. The stronger 
property of factorization of the likelihood docs not 
hold. 

3. LIKELIHOOD INFERENCE WITH NO 
NUISANCE PARAMETERS 

We assume now that is one-dimensional. A plot 
of the log-likelihood function as a function of 9 can 
quickly reveal irregularities in the model, such as a 
non-unique maximum, or a maximum on the bound- 
ary, and can also provide a visual guide to deviance 
from normality, as the log-likelihood function for a 
normal distribution is a parabola and hence symmet- 
ric about the maximum. In order to calibrate the 
log-likelihood function we can use the approximation 

r{6) = sign((9 - O)[2{£(0) - £(9)}} 1/2 ~ N(0, 1), (9) 

which is equivalent to the result that twice the log like- 
lihood ratio is approximately \\- This will typically 
provide a better approximation than the asymptoti- 
cally equivalent result that 

9-e ^ N(o,r\e)) (10) 

as it partially accommodates the potential asymme- 
try in the log-likelihood function. These two approx- 
imations are sometimes called first order approxima- 
tions because in the context where the log-likelihood 
is 0(n), we have (under regularity conditions) results 
such as 

Pr{r(<9; y) < r(9; y")} = Pr{Z < r(9; y")} (11) 

{l + Oin- 1 ' 2 )} 



Table I The p-values for testing fi — 0, i.e. that the 
number of observed events is consistent with the 
background. 

upper p-value 0.0005993 
lower p-value 0.0002170 
mid p-value 0.0004081 
<E>(r*) 0.0003779 
<E>(r) 0.0004416 
${(0 - 0)j 1/2 } 0.0062427 



where Z follows a standard normal distribution. It 
is relatively simple to improve the approximation to 
third order, i.e. with relative error 0(n~ 3 / 2 ), using 
the so-called r* approximation 

r*(9) = r(9) + {l/r(9)} \og{q(9)/r(9)} ~ N(0, 1) 

(12) 

where q(9) is a likelihood-based statistic and a gener- 
alization of the Wald statistic (6— 6*)? 1//2 (^); see Fraser 
[2]- 

Example 3: truncated Poisson 

Suppose that y follows a Poisson distribution with 
mean 6 = b + ji, where 6 is a background rate that 
is assumed known. In this model the p-value function 
can be computed exactly simply by summing the Pois- 
son probabilities. Because the Poisson distribution is 
discrete, the p-value could reasonably be defined as 
either 

Pr(y<y°;9) (13) 

or 

Pr(y<y°;9), (14) 

sometimes called the upper and lower p- values, respec- 
tively. 

For the values y° = 17, b = 6.7, Figure 1 shows 
the likelihood function as a function of /1 and the p- 
value function p(/i) computed using both the upper 
and lower p-values. In Figure 2 we plot the mid p- 
value, which is 

Prfo < y°) + (l/2)Pr(i/ = y°). (15) 

The approximation based on r* is nearly identical to 
the mid-p- value; the difference cannot be seen on Fig- 
ure 2. Table 1 compares the p-values at = 0. This 
example is taken from Fraser, Reid and Wong [3]. 

4. PROFILE AND ADJUSTED PROFILE 
LIKELIHOOD FUNCTIONS 

We now assume 9 — (ip,X) and denote by X^ the 
restricted maximum likelihood estimate obtained by 
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Figure 1: The likelihood function (top) and p- value 
function (bottom) for the Poisson model, with b — 6.7 
and y° = 17. For fi — the p-value interval is 
(0.99940,0.99978). 
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Figure 2: The upper and lower p- value functions and the 
mid-p- value function for the Poisson model, with b — 6.7 
and y° = 17. The approximation based on $(r*) is 
identical to the mid-p-value function to the drawing 
accuracy. 



maximizing the likelihood function over the nuisance 
parameter A with -0 fixed. The profile likelihood func- 
tion is 

L p ^)=L{^X)- (16) 

also sometimes called the concentrated likelihood or 
the peak likelihood. The approximations of the pre- 
vious section generalize to 

rty) = sign(^ - i>)[2{£ p (4>) ~ W}] 1/2 ~ N(0, 1), 

(17) 

and 

^-^{^(fl)}- 1 ). (18) 

These approximations, like the ones in Section 3, are 
derived from asymptotic results which assume that 
n — ► oo, that we have a vector y of independent, iden- 
tically distributed observations, and that the dimen- 
sion of the nuisance parameter does not increase with 
n. Further regularity conditions are required on the 
model, such as are outlined in textbook treatments 
of the asymptotic theory of maximum likelihood. In 
finite samples these approximations can be mislead- 
ing: profile likelihood is too concentrated, and can be 
maximized at the 'wrong' value. 

Example 4: normal theory regression Suppose 
jji = x'j3 + ti, where Xi = (xn, . . . , Xi P ) is a vector of 
known covariate values, (3 is an unknown parameter 
of length p, and e, is assumed to follow a N(0,t(>) 
distribution. The maximum likelihood estimate of i\j 
is 

V3 = is( y< - x'J) 2 (19) 

which tends to be too small, as it does not allow for 
the fact that p unknown parameters (the components 
of P) have been estimated. In this example there is 
a simple improvement, based on the result that the 
likelihood function for (/3, ip) factors into 

L-iifl, i>; y)L 2 {4>', — a;-/3) 2 } (20) 

where L2(ip) is proportional to the marginal distri- 
bution of Yj(yi — x^P) 2 . Figure 3 shows the profile 
likelihood and the marginal likelihood; it is easy to 
verify that the latter is maximized at 

?L = S(y, " X'J) 2 (21) 

n — p 

which in fact is an unbiased estimate of ip. 

Example 5: product of exponential means 

Suppose we have independent pairs of observations 
Vu, Viu where y u ~ Exp(ip\i) y 2l ~ Exp(tp/Xi),i = 
1, ... ,n. The limiting normal theory for profile likeli- 
hood does not apply in this context, as the dimension 
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Figure 3: Profile likelihood and marginal likelihood for 
the variance parameter in a normal theory regression 
with 21 observations and three covariates (the "Stack 
Loss" data included in the Splus distribution). The 
profile likelihood is maximized at a smaller value of tf>, 
and is narrower; in this case both the estimate and its 
estimated standard error are too small. 
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of the parameter is not fixed but increasing with the 
sample size, and it can be shown that 

j, - ^ (22) 

as n — > oo (Cox and Reid [4]). 

The theory of higher order approximations can be 
used to derive a general improvement to the profile 
likelihood or log-likelihood function, which takes the 
form 

4W0 = + \ log IjaaOA, A*)| + (23) 

where j\\ is defined by the partitioning of the ob- 
served information function, and B(ip) is a further 
adjustment function that is O p (l). Several versions of 
B(ip) have been suggested in the statistical literature: 
we use the one defined in Fraser [5] given by 

S(VO = ~ log \<f/ x y>, X^j^tf, A)^(V, A^)|. (24) 

This depends on a so-called canonical parametrization 
ip = ip(9) = l;v{8;y°) which is discussed in Fraser, 
Reid and Wu [6] and Reid [7]. 

In the special case that ip is orthogonal to the nui- 
sance parameter A a simplification of £ a (ip) is available 
as 

tcpm = i P W - ilogliAAMA*)! (25) 

THAT001 



which was first introduced in Cox and Reid (1987). 
The change of sign on log | j | comes from the or- 
thogonality equations. In i.i.d. sampling, £ p (ip) is 
O p (n), i.e. is the sum of n bounded random variables, 
whereas log\j\ is O p (l). A drawback of Icr is that 
it is not invariant to one-to-one reparametrizations 
of A, all of which are orthogonal to ip. In contrast 
£ a {ip) is invariant to transformations 9 — (ip,X) to 
9' = (ip, r](ip, A)), sometimes called interest-respecting 
transformations . 

Example 5 continued In this example ip is or- 
thogonal to A = (Ai, . . . , A„), and 

tcR(ip) = -(in/2) log?/; - (2/xP)^(y u y 2i ). (26) 

The value that maximizes ten is 'more nearly con- 
sistent' than the maximum likelihood estimate as 
tpCR — > (tt/3)^. 

5. F-VALUES FROM PROFILE 
LIKELIHOOD 

The limiting theory for profile likelihood gives first 
order approximations to p- values, such as 

p(il>) = *(r p ) (27) 

and 

pW) = H(i>-^)jp /2 W} (28) 

although the discussion in the previous section sug- 
gests these may not provide very accurate approxima- 
tions. As in the scalar parameter case, though, a much 
better approximation is available using $(r*) where 

r*(VO = r p (VO + l/{r P (^)}log{Q(V)/r p (V)} (29) 

where Q can also be derived from the likelihood func- 
tion and a function tp(8,y°) as 

Q = (0- v^a- 1 ' 2 

where 

= e>(0) , 

= ip v >0^)/\ip<p'0i>)\ > 

°l = |j(AA)(^)l/|j(M)(^)l , 

\j m (0)\ = \jee(9)\W(9)\- 2 , 

Ij ( aa)(^)I = \3xx0i,)\\<py0i,)\~ 7 ■ 

The derivation is described in Fraser, Reid and 
Wu [6] and Reid [7]. The key ingredients are the 
log-likelihood function £(9) and a reparametrization 
<p(0) — <p{0; y ), which is defined by using an approx- 
imating model at the observed data point y°; this ap- 
proximation in turn is based on a conditioning argu- 
ment. A closely related approach is due to Barndorff- 
Nielsen; see Barndorff-Nielsen and Cox [8, Ch. 7], and 
the two approaches are compared in [7] . 
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Table II Employment of men and women at the Space 
Telescope Science Institute, 1998-2002 (from Science 
magazine, Volume 299, page 993, 14 February 2003). 



Left Stayed Total 



Men 1 
Women 5 



18 
2 



19 
7 



Total 



(i 



20 



26 



Figure 4: The p- value function for the log-odds ratio, ip, 
for the data of Table II. The value tp = corresponds to 
the hypothesis that the probabilities of leaving are equal 
for men and women. 



Example 6: comparing two binomials Table 2 
shows the employment history of men and women at 
the Space Telescope Science Institute, as reported in 
Science Feb 14 2003. We denote by y\ the number 
of males who left and model this as a Binomial with 
sample size 19 and probability p±; similarly the num- 
ber of females who left, y^, is modelled as Binomial 
with sample size 7 and probability p2- We write the 
parameter of interest 



tp = log 



P1O--P2) 
P2O--P1)' 



(30) 



The hypothesis of interest is pi — P2, or ip = 0. The p- 
value function for ip is plotted in Figure 4. The p- value 
at ip = is 0.00028 using the normal approximation 
to r p , and is 0.00048 using the normal approximation 
to r*. Using Fisher's exact test gives a mid p- value 
of 0.00090, so the approximations are anticonservative 
in this case. 



Example 7: Poisson with estimated back- 
ground Suppose in the context of Example 3 that 
we allow for imprecision in the background, replacing 
b by an unknown parameter (3 with estimated value 0. 
We assume that the background estimate is obtained 
from a Poisson count x, which has mean k(3, and the 
signal measurement is an independent Poisson count, 
y, with mean /3+/I. We have (3 — x/k and var/3 = (3/ k, 
so the estimated precision of the background gives us 
a value for k. For example, if the background is es- 
timated to be 6.7 ± 2.1 this implies a value for k of 
6. 7/(2. 1) 2 = 1.5. Uncertainty in the standard error 
of the background is ignored here. We now outline 
the steps in the computation of the r* approximation 

est. 

The log-likelihood function based on the two inde- 
pendent observations x and y is 

£(13, ti)=x ]og(k0) -kf3 + y log(/3 + /x) - - fi (31) 

with canonical parameter <p = (log/3, log(/3 + /x))'. 
Then 



89' 





!/(/? + /*) 



1/(3 
1/09 + /*) 




from which 



Then we have 



ip v 



-13,13 + n). 



X{0) 



. -^log(/3) + (/3 M + M)log(/3 + A) 
-Pn log(/3 M ) + (fl, + n) log(/3 M + fi) 

\j(89)(0)\ = ViV2 = k/(3{(3 + jl) 



li(AA)(^)| 
and finally 



(32) 
(33) 
(34) 

(35) 

(36) 

(37) 
(38) 



{kp30+m 1/2 



(39) 



{W M + /i) 2 + C9 + /i)/9j} 1/2 
The likelihood root is 

r = sign(Q)V[2{£(/3,A)-^^)}] (40) 
= sign(Q)V(2[fc/31og{/3//3, I }) + + fi) 
log{(/3 + A)/(/3 M + M)} 
-k0 - fa + 0,, + »)}}). (41) 
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The third order approximation to the p- value function 
is 1 — <f>(r*), where 



r*=r+(l/r) log(Q/r). 



(42) 



Figure 5 shows the p- value function for fi using the 
mid-p- value function from the Poisson with no adjust- 
ment for the error in the background, and the p-value 
function from 1 — $(r*). The p- value for testing /i = 
is 0.00464, allowing for the uncertainty in the back- 
ground, whereas it is 0.000408 ignoring this uncer- 
tainty. 

The hypothesis Ey = (3 could also be tested by 
modelling the mean of y as v(3, say, and testing the 
value v = 1. In this formulation we can eliminate 
the nuisance parameter exactly by using the binomial 
distribution of y conditioned on the total x + y, as 
described in example 1. This gives a mid-p- value of 
0.00521. The computation is much easier than that 
outlined above, and seems quite appropriate for test- 
ing the equality of the two means. However if infer- 
ence about the mean of the signal is needed, in the 
form of a point estimate or confidence bounds, then 
the formulation as a ratio seems less natural at least 
in the context of HEP experiments. A more complete 
comparison of methods for this problem is given in 
Linncmann [8]. 



6. CONDITIONAL AND MARGINAL 
LIKELIHOOD 

In special model classes, it is possible to elimi- 
nate nuisance parameters by either conditioning or 
marginalizing. The conditional or marginal likelihood 
then gives essentially exact inference for the parame- 
ter of interest, if this likelihood can itself be computed 
exactly. In Example 1 above, L\ is the density for y 2 
conditional on y\ + 2/2, so is a conditional likelihood 
for ip. This is an example of the more general class of 
linear exponential families: 

/(y; V, A) = eMMy)+Xt(y)~c(^, X)-d(y)}; (43) 

in which 

fcond{s \t;4>) = exp{^s - C t (ii>) - D t {s)} (44) 

defines the conditional likelihood. The comparison of 
two binomials in Example 6 is in this class, with ip 
as defined at (30) and A = log{p 2 /(l — ^2)}- The 
difference of two Poisson means, in Example 7, can- 
not be formulated this way, however, even though the 
Poisson distribution is an exponential family, because 
the parameter of interest ijj is not a component of the 
canonical parameter. 

It can be shown that in models of the form (|43|l 
the log-likelihood = e p (ip) + (l/2)log|jAA| ap- 

proximates the conditional log- likelihood £ C ond(tp) = 



log fcond(s \t;i/)), and that 

p(ip) = $(r*) (45) 

where 

r = r a H log( — ) 

r a r a 

r a = ±[2{4W>a)-40A)}] 1/2 

Q = (V3 a -^){jaW} 1/2 

approximates the p-value function with relative error 
0(n -3 / 2 ) in i.i.d. sampling. An asymptotically equiv- 
alent approximation based on the profile log-likelihood 
is 



where 



p(V0 = $(r*) 



r p + — log(— ) 



(46) 



r p = ±[2{^(V3)-£ P (V)}] 1/2 



i,2 \jxxWM\ 1/2 



bAx(^,A)|Va 

In the latter approximation an adjustment for nui- 
sance parameters is made to Q, whereas in the former 
the adjustment is built into the likelihood function. 
Approximation (46) was used in Figure 3. 

A similar discussion applies to the class of transfor- 
mation models, using marginal approximations. Both 
classes are reviewed in Reid [9] . 
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Figure 5: Comparison of the p-value functions computed assuming the background is known and using the mid-p-value 
with the third order approximation allowing a background error of ±1.75. 
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